Display machine information:

sessionInfo()
## R version 3.6.0 (2019-04-26)
## Platform: x86_64-redhat-linux-gnu (64-bit)
## Running under: CentOS Linux 7 (Core)
## 
## Matrix products: default
## BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] digest_0.6.29   R6_2.5.1        jsonlite_1.7.2  magrittr_2.0.1 
##  [5] evaluate_0.14   stringi_1.7.6   rlang_1.0.1     cli_3.1.0      
##  [9] rstudioapi_0.13 jquerylib_0.1.4 bslib_0.3.1     rmarkdown_2.11 
## [13] tools_3.6.0     stringr_1.4.0   xfun_0.29       yaml_2.2.1     
## [17] fastmap_1.1.0   compiler_3.6.0  htmltools_0.5.2 knitr_1.37     
## [21] sass_0.4.0

Load database libraries and the tidyverse frontend:

suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(lubridate))
suppressPackageStartupMessages(library(miceRanger))

Q1. Missing data

Through the Shiny app developed in HW3, we observe abundant missing values in the MIMIC-IV ICU cohort we created. In this question, we use multiple imputation to obtain a data set without missing values.

  1. Read following tutorials on the R package miceRanger for imputation: https://github.com/farrellday/miceRanger, https://cran.r-project.org/web/packages/miceRanger/vignettes/miceAlgorithm.html.

    A more thorough book treatment of the practical imputation strategies is the book Flexible Imputation of Missing Data by Stef van Buuren.

  2. Explain the jargon MCAR, MAR, and MNAR.

Reference: * Concepts of MCAR, MAR and MNAR * Concepts in incomplete data

  1. Explain in a couple of sentences how the Multiple Imputation by Chained Equations (MICE) work.

MICE imputes missing data in a dataset (it should be dependent variable) through an iterative series of predictive models. In each iteration, each specified variable in the dataset is imputed using the other variables in the dataset. These iterations should be run until it appears that convergence has been met and all specified variables have been imputed.

Step. 1 Replace the missing value with the mean value observed in the data.

Step. 2 The random forest (or linear regression or Predictive Mean Matching) is used to predict each variable with other vairables and the correlation between some variables will improve.

Step. 3 This process is continued until all specified variables have been imputed. The MICE finished and the correlation between some variables will be much closer to original data.

  1. Perform a data quality check of the ICU stays data. Discard variables with substantial missingness, say >5000 NAs. Replace apparent data entry errors by NAs.
icu_stay <- readRDS("../hw3/mimiciv_shiny/icu_cohort.rds")
icu_stay$thirty_day_mort <- 
  ifelse(is.na(icu_stay$thirty_day_mort) == T, 0, icu_stay$thirty_day_mort)

summary(icu_stay)
##    subject_id           intime                       hadm_id        
##  Min.   :10000032   Min.   :2110-01-11 10:16:06   Min.   :20000094  
##  1st Qu.:12483760   1st Qu.:2132-11-03 21:59:53   1st Qu.:22478418  
##  Median :14991992   Median :2152-09-25 12:29:44   Median :24948788  
##  Mean   :14990135   Mean   :2152-10-26 05:18:41   Mean   :24976352  
##  3rd Qu.:17497349   3rd Qu.:2172-10-29 04:56:08   3rd Qu.:27477545  
##  Max.   :19999987   Max.   :2211-05-01 06:59:19   Max.   :29999828  
##                                                                     
##     stay_id         first_careunit     last_careunit     
##  Min.   :30000153   Length:53150       Length:53150      
##  1st Qu.:32466765   Class :character   Class :character  
##  Median :34987382   Mode  :character   Mode  :character  
##  Mean   :34984425                                        
##  3rd Qu.:37487830                                        
##  Max.   :39999810                                        
##                                                          
##     outtime                         los               gender         
##  Min.   :2110-01-12 17:17:47   Min.   :  0.00125   Length:53150      
##  1st Qu.:2132-11-06 22:32:25   1st Qu.:  1.08479   Class :character  
##  Median :2152-09-28 07:17:57   Median :  1.87642   Mode  :character  
##  Mean   :2152-10-29 13:58:59   Mean   :  3.36132                     
##  3rd Qu.:2172-10-30 21:16:16   3rd Qu.:  3.53770                     
##  Max.   :2211-05-10 22:51:06   Max.   :110.23228                     
##                                                                      
##    anchor_age     anchor_year   anchor_year_group       dod            
##  Min.   :18.00   Min.   :2109   Length:53150       Min.   :2110-01-25  
##  1st Qu.:53.00   1st Qu.:2131   Class :character   1st Qu.:2133-04-29  
##  Median :65.00   Median :2151   Mode  :character   Median :2153-02-15  
##  Mean   :63.51   Mean   :2151                      Mean   :2153-06-18  
##  3rd Qu.:77.00   3rd Qu.:2171                      3rd Qu.:2173-09-20  
##  Max.   :91.00   Max.   :2207                      Max.   :2211-01-17  
##                                                    NA's   :45232       
##    admittime                     dischtime                  
##  Min.   :2110-01-11 10:14:00   Min.   :2110-01-15 17:31:00  
##  1st Qu.:2132-11-01 01:41:45   1st Qu.:2132-11-11 14:40:00  
##  Median :2152-09-24 06:45:00   Median :2152-10-03 05:40:00  
##  Mean   :2152-10-25 04:22:49   Mean   :2152-11-03 14:27:00  
##  3rd Qu.:2172-10-27 23:30:45   3rd Qu.:2172-11-07 08:56:15  
##  Max.   :2211-05-01 06:57:00   Max.   :2211-05-12 17:54:00  
##                                                             
##    deathtime                   admission_type     admission_location
##  Min.   :2110-01-25 09:40:00   Length:53150       Length:53150      
##  1st Qu.:2132-07-10 19:30:00   Class :character   Class :character  
##  Median :2152-02-26 23:34:00   Mode  :character   Mode  :character  
##  Mean   :2152-05-25 05:05:24                                        
##  3rd Qu.:2172-05-04 00:07:30                                        
##  Max.   :2211-01-17 12:34:00                                        
##  NA's   :47642                                                      
##  discharge_location  insurance           language         marital_status    
##  Length:53150       Length:53150       Length:53150       Length:53150      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##   ethnicity           edregtime                     edouttime                  
##  Length:53150       Min.   :2110-01-11 21:42:00   Min.   :2110-01-12 00:54:00  
##  Class :character   1st Qu.:2133-09-01 09:15:00   1st Qu.:2133-09-01 15:55:45  
##  Mode  :character   Median :2153-05-28 01:29:30   Median :2153-05-28 06:58:00  
##                     Mean   :2153-07-10 17:24:31   Mean   :2153-07-10 23:17:28  
##                     3rd Qu.:2173-07-23 10:55:15   3rd Qu.:2173-07-23 17:29:15  
##                     Max.   :2211-05-01 06:07:00   Max.   :2211-05-01 09:57:00  
##                     NA's   :19618                 NA's   :19618                
##  hospital_expire_flag    age_hadm        hematocrit      magnesium    
##  Min.   :0.0000       Min.   : 18.00   Min.   : 0.00   Min.   :0.000  
##  1st Qu.:0.0000       1st Qu.: 54.00   1st Qu.:22.60   1st Qu.:1.500  
##  Median :0.0000       Median : 66.00   Median :26.80   Median :1.700  
##  Mean   :0.1038       Mean   : 64.47   Mean   :27.61   Mean   :1.721  
##  3rd Qu.:0.0000       3rd Qu.: 78.00   3rd Qu.:32.10   3rd Qu.:1.900  
##  Max.   :1.0000       Max.   :102.00   Max.   :61.00   Max.   :9.200  
##                                        NA's   :699     NA's   :857    
##      sodium         chloride        calcium         WBC_count      
##  Min.   : 67.0   Min.   : 39.0   Min.   : 0.000   Min.   :  0.000  
##  1st Qu.:131.0   1st Qu.: 94.0   1st Qu.: 7.400   1st Qu.:  4.500  
##  Median :135.0   Median : 98.0   Median : 7.900   Median :  6.200  
##  Mean   :133.9   Mean   : 97.3   Mean   : 7.822   Mean   :  6.836  
##  3rd Qu.:137.0   3rd Qu.:101.0   3rd Qu.: 8.400   3rd Qu.:  8.200  
##  Max.   :173.0   Max.   :142.0   Max.   :43.000   Max.   :471.700  
##  NA's   :683     NA's   :690     NA's   :1842     NA's   :720      
##   bicarbonate      creatinine        potassium        glucose      
##  Min.   : 2.00   Min.   : 0.0000   Min.   :0.700   Min.   :   0.0  
##  1st Qu.:18.00   1st Qu.: 0.5000   1st Qu.:3.200   1st Qu.:  75.0  
##  Median :21.00   Median : 0.7000   Median :3.500   Median :  87.0  
##  Mean   :20.21   Mean   : 0.8932   Mean   :3.495   Mean   :  89.5  
##  3rd Qu.:23.00   3rd Qu.: 1.0000   3rd Qu.:3.800   3rd Qu.:  99.0  
##  Max.   :46.00   Max.   :17.9000   Max.   :9.400   Max.   :1091.0  
##  NA's   :700     NA's   :680       NA's   :682     NA's   :707     
##  body_temperature systolic_non_invasive_blood_pressure   heart_rate        
##  Min.   :-99.90   Min.   :-69.00                       Min.   :-241395.00  
##  1st Qu.: 96.70   1st Qu.: 79.00                       1st Qu.:     54.00  
##  Median : 97.40   Median : 90.00                       Median :     62.00  
##  Mean   : 96.53   Mean   : 89.34                       Mean   :     57.09  
##  3rd Qu.: 97.80   3rd Qu.:100.00                       3rd Qu.:     72.00  
##  Max.   :104.00   Max.   :246.00                       Max.   :    150.00  
##  NA's   :820      NA's   :512                          NA's   :1           
##  mean_non_invasive_blood_pressure respiratory_rate thirty_day_mort  
##  Min.   :-22767.00                Min.   : 0.000   Min.   :0.00000  
##  1st Qu.:    48.00                1st Qu.: 7.000   1st Qu.:0.00000  
##  Median :    57.00                Median :10.000   Median :0.00000  
##  Mean   :    56.15                Mean   : 9.403   Mean   :0.04709  
##  3rd Qu.:    65.00                3rd Qu.:12.000   3rd Qu.:0.00000  
##  Max.   :   162.00                Max.   :37.000   Max.   :1.00000  
##  NA's   :533                      NA's   :37
#From the summary table, we decide to drop the variables with substantial missingness, say >5000 NAs
#drop "deathtime", "edregtime", "edouttime", "dod"
icu_stay <- icu_stay %>% 
  select(-c("deathtime", "edregtime", "edouttime", "dod"))

#replace lab and vital measurements to NA
for (var in colnames(icu_stay[,24:38])){
 icu_stay[[var]][icu_stay[[var]] %in% boxplot.stats(icu_stay[[var]])$out] <- NA
}

colSums(is.na(icu_stay[,24:38]))
##                           hematocrit                            magnesium 
##                                  879                                 1617 
##                               sodium                             chloride 
##                                 2383                                 2582 
##                              calcium                            WBC_count 
##                                 3213                                 2733 
##                          bicarbonate                           creatinine 
##                                 2946                                 4137 
##                            potassium                              glucose 
##                                 1947                                 3959 
##                     body_temperature systolic_non_invasive_blood_pressure 
##                                 3515                                 2284 
##                           heart_rate     mean_non_invasive_blood_pressure 
##                                 2849                                 1845 
##                     respiratory_rate 
##                                  531
  1. Impute missing values by miceRanger (request \(m=3\) data sets). This step is computational intensive. Make sure to save the imputation results as a file. Hint: Setting max.depth=10 in the miceRanger function may cut some computing time.
seqTime <- system.time(
  miceObj <- miceRanger(
      icu_stay
    , m=3
    , returnModels = FALSE
    , verbose=FALSE
    , max.depth=10
  )
)
miceObj
miceObj %>% save(file = "../hw4/miceObj.RData")
completeData(miceObj)[1] %>% write_rds(file = "../hw4/icu_cohort_1.rds")
completeData(miceObj)[2] %>% write_rds(file = "../hw4/icu_cohort_2.rds")
completeData(miceObj)[3] %>% write_rds(file = "../hw4/icu_cohort_3.rds")
  1. Make imputation diagnostic plots and explain what they mean.
plotDistributions(miceObj,vars='allNumeric')

plotDistributions

From the plot, we can see the imputed distributions compared to the original distribution for each variable. The red line is the density of the original, nonmissing data while the little black line are the density of the imputed values in each of imputed datasets. If they are not matched up, then it tells us that it was not Missing Completely at Random (MCAR). Thus, we can tell that only magnesium, calcium, glucose, body_temperture, heart_rate, and respiratory_rate look like MCAR.

plotCorrelations(miceObj,vars='allNumeric')

plotCorrelations

From the plot of Convergence of Correlation, we can visualize how values between datasets converged over the iterations.

plotVarConvergence(miceObj,vars='allNumeric')

plotVarConvergence

From the plot of Center and Dispersion Convergence, we would like to make sure whether dataset converge to the true theoretical mean. It doesn’t look like this dataset had a convergence issue.

plotModelError(miceObj,vars='allNumeric')

plotModelError

From the plot above, it looks like the variables were imputed with a reasonable degree of accuracy. However, for the respiratory_rate and mean_non_invasive_blood_pressure did not do well on the r-squared.

  1. Choose one of the imputed data sets to be used in Q2. This is not a good idea to use just one imputed data set or to average multiple imputed data sets. Explain in a couple of sentences what the correct Multiple Imputation strategy is.

For the correct Multiple Imputation strategy, we have to apply statistical tests in each imputed dataset and then pool the results to obtain summary estimates. For example, the descriptive statistics include pooling means and standard deviations; the difference in mean includes pooling independent t-tests; the relationship between the variables includes pooling logistic, Cox, or linear regression models. Thus, this is not a good idea to use just one imputed data set or to average multiple imputed data sets, but we can analyze each dataset and then pool their result together which is a more correct Multiple Imputation strategy.

Data analysis after Multiple Imputation

Q2. Predicting 30-day mortality

Develop at least two analytic approaches for predicting the 30-day mortality of patients admitted to ICU using demographic information (gender, age, marital status, ethnicity), first lab measurements during ICU stay, and first vital measurements during ICU stay. For example, you can use (1) logistic regression (glm() function in base R or keras), (2) logistic regression with lasso penalty (glmnet or keras package), (3) random forest (randomForest package), or (4) neural network (keras package).

  1. Partition data into 80% training set and 20% test set. Stratify partitioning according the 30-day mortality status.
library(rsample)
icu_cohort_2 <- readRDS("~/biostat-203b-2022-winter/hw4/icu_cohort_2.rds")

#Data split
set.seed(12345)
icu_split <- initial_split(icu_cohort_2$Dataset_2, prop = 0.8
                           , strata = thirty_day_mort)
icu_train <- icu_split %>% training()
icu_test <- icu_split %>% testing()
rm(icu_cohort_2)
rm(icu_split)
  1. Train the models using the training set.
#Logistics model
model1 <- glm(thirty_day_mort ~ ., data = icu_train
              , family = binomial(link="logit"))

icu_test$first_careunit = as.factor(icu_test$first_careunit)
icu_test$last_careunit = as.factor(icu_test$last_careunit)
icu_test$gender = as.factor(icu_test$gender)
icu_test$anchor_year_group = as.factor(icu_test$anchor_year_group)
icu_test$admission_type = as.factor(icu_test$admission_type)
icu_test$admission_location = as.factor(icu_test$admission_location)
icu_test$discharge_location = as.factor(icu_test$discharge_location)
icu_test$insurance = as.factor(icu_test$insurance)
icu_test$language = as.factor(icu_test$language)
icu_test$marital_status = as.factor(icu_test$marital_status)
icu_test$ethnicity = as.factor(icu_test$ethnicity)

icu_train$first_careunit = as.factor(icu_train$first_careunit)
icu_train$last_careunit = as.factor(icu_train$last_careunit)
icu_train$gender = as.factor(icu_train$gender)
icu_train$anchor_year_group = as.factor(icu_train$anchor_year_group)
icu_train$admission_type = as.factor(icu_train$admission_type)
icu_train$admission_location = as.factor(icu_train$admission_location)
icu_train$discharge_location = as.factor(icu_train$discharge_location)
icu_train$insurance = as.factor(icu_train$insurance)
icu_train$language = as.factor(icu_train$language)
icu_train$marital_status = as.factor(icu_train$marital_status)
icu_train$ethnicity = as.factor(icu_train$ethnicity)

#Random Forest
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
fit <- randomForest(as.factor(thirty_day_mort) ~ ., data=icu_train, ntree = 300)
  1. Compare model prediction performance on the test set.
# Probability and class preds
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
res_preds <- predict(model1, icu_test, type = 'response')
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
pred <- as.numeric(res_preds > 0.5)
pred2 <- as.factor(pred)
status_test <- as.numeric(icu_test$thirty_day_mort)
a <- confusionMatrix(pred2, as.factor(icu_test$thirty_day_mort))

## confusion matrix
fourfoldplot(a$table, color = c("cyan", "pink"),
             conf.level = 0, margin = 1
             , main = "Confusion Matrix for Logistics model")

res_preds1 <- predict(fit, icu_test, type = 'response')
status_test <- as.numeric(icu_test$thirty_day_mort) - 1
b <- confusionMatrix(res_preds1, as.factor(icu_test$thirty_day_mort))

## confusion matrix
fourfoldplot(b$table, color = c("cyan", "pink"),
             conf.level = 0, margin = 1
             , main = "Confusion Matrix for Random Forest")

Accuracy_log <- a$overall[1]
Accuracy_rf <- b$overall[1]
Accuracy_log
## Accuracy 
## 0.956444
Accuracy_rf
##  Accuracy 
## 0.9570085